Genome-Wide Association and RNA-Seq Analyses Reveal a Potential Candidate Gene Related to Oil Content in Soybean Seeds

Soybean is a crucial crop globally, serving as a significant source of unsaturated fatty acids and protein in the human diet. However, further enhancements are required for the related genes that regulate soybean oil synthesis. In this study, 155 soybean germplasms were cultivated under three different environmental conditions, followed by phenotypic identification and genome-wide association analysis using simplified sequencing data. Genome-wide association analysis was performed using SLAF-seq data. A total of 36 QTLs were significantly associated with oil content (−log10(p) > 3). Out of the 36 QTLs associated with oil content, 27 exhibited genetic overlap with previously reported QTLs related to oil traits. Further transcriptome sequencing was performed on extreme high–low oil soybean varieties. Combined with transcriptome expression data, 22 candidate genes were identified (|log2FC| ≥ 3). Further haplotype analysis of the potential candidate genes showed that three potential candidate genes had excellent haplotypes, including Glyma.03G186200, Glyma.09G099500, and Glyma.18G248900. The identified loci harboring beneficial alleles and candidate genes likely contribute significantly to the molecular network’s underlying marker-assisted selection (MAS) and oil content.


Introduction
Lipids are essential metabolic compounds in plants that play a pivotal role in plant growth [1].The oil content of seeds in different crops exhibits significant variation, with those in maize ranging from 3% to 7%, soybeans ranging from 16% to 23%, rapeseed ranging from 36% to 47%, and peanuts ranging from 40% to 50% [2 -5].Triacylglycerol serves as the predominant constituent of seed oil, comprising three molecules of long-chain fatty acids and one molecule of glycerol [6].
Soybean, a primary oil crop, synthesizes oil mainly in the endoplasmic reticulum.Using acetyl-CoA and glycerol-3-phosphate (G3P) as substrates for a series of reactions, this process is called the Kennedy pathway [7].Glycerol-3-phosphate acyltransferases and lysophosphatidic acid acyltransferase catalyze the continuous acylation of the sn-1 and sn-2 sites of G3P, resulting in the formation of phosphatidic acid [8].Moreover, phosphatidic acid (PA) undergoes dephosphorylation at the sn-3 position, catalyzed by phosphatidic acid phosphohydrolase (PAP), resulting in the formation of diacylglycerol (DAG) [9].The generated DAG is further acylated to TAG at the sn-3 position, catalyzed by diacylglycerol acyltransferase (DGAT) [10].Due to its instability in the endoplasmic reticulum, the TAG eventually undergoes sequestration within the seed as an oil body through interaction with specific oil body proteins [11].It was reported that diacylglycerol acyltransferase (DGAT1-2) plays a pivotal role as a key enzyme in catalyzing oil synthesis, thereby significantly enhancing the oil content in maize seeds [12].
With the advancement of omics technology, high-throughput transcriptome sequencing and genome-wide association analysis have become ubiquitous tools for investigating specific traits.Research reports that a total of 96,432 SNPs were identified with 203 soybean accessions.A total of 44 QTLs were identified to be significantly associated with protein, oil, and amino acid content, and the genes Glyma.11G015500 and Glyma.20G050300 were identified as novel candidates for protein and oil content, respectively [13].The oil content of 588 rapeseed materials was associated with 385,692 SNPs through genome-wide association analysis, resulting in the identification of 17 significant association sites; 13 SNPs were found to be located on chromosomes A3 (11 loci) and A1 (one locus), while 5 novel SNPs were discovered on chromosomes C5 (one locus) and C7 (four loci) [14].A genome-wide association analysis was performed on grain weight using 24,180 SNPs from 185 soybean materials, resulting in the identification of 34 SNPs significantly correlated with grain weight, including 19 newly discovered QTNs [15].Two varieties representing different contents of unsaturated fatty acids were selected from 314 soybean materials, and RNA-seq analysis was performed at three different developmental stages for these selected varieties.A total of 2080, 11,343, and 2230 DEGs were identified [16].Transcriptome analysis of seed embryos at 0 h, 12 h, and 48 h of imbibition revealed that brassinosteroids (BRs) can enhance seed germination by facilitating phosphorylation-mediated reactions to abscisic acid (ABA), while epigenetic modifications may serve as a crucial regulatory mechanism governing seed dormancy and germination [17].
To further elucidate the regulatory mechanism underlying soybean oil content, in this study, oil content of 155 soybean materials was determined, and genome-wide association analysis was performed.Furthermore, extreme high-low oil soybean materials were screened for transcriptome sequencing.The putative genes governing oil content were identified through the integration of GWAS and transcriptome analysis.

Statistical Analysis of Oil Content
A total of 155 soybean materials were utilized in this study, which were collected from three distinct locations in Heilongjiang province, namely, Nenjiang (124 • 44 ′ N, 48  04 ′ E) in 2023.The oil content of the 155 soybean materials exhibited regional variations, with the coefficient of variation ranging between 4.4% and 4.6% across different regions (Table S1).The phenotypic changes observed in Nenjiang, Beian, and the five connected lakes area ranged from 16.9% to 22.5%, 16.4% to 22.4%, and 16.6% to 22.3%, respectively (Figure 1).

Population Structure Analysis
In this study, we performed specific-locus amplified fragment sequencing on 155 soybean samples, and a total of 23,131 high-quality SNPs were obtained (MAF > 0.05, missing data < 10%).The SNPs obtained above were evenly distributed across the 20 chromosomes of soybean (Figure S1A).Principal component analysis (PCA) showed that there was an inflection point at PC3 (Figure S1B).The population exhibited no discernible stratification, while the first three principal components exerted a significant influence on the population structure (Figure S1C).The genetic relationships among 155 soybean germplasm were determined based on the SNP genotyping results (Figure S1D).

Genome-Wide Association Study
We analyzed the oil content in three environments using a compressed mixed linear model (CMLM) with 23,131 SNP markers across the genome through a genome-wide association study.The results showed that 36 QTLs were significantly associated with oil content (−log10(p) > 3) (Figure 2).The subsequent analysis revealed that stable QTLs that could be screened under multiple environmental conditions were mainly distributed on chromosomes 3, 4, 7, 9, 10, 11, 12, 13, 15, 18, and 19.Out of the 36 QTLs associated with oil content, 27 exhibited genetic overlap with previously reported QTLs related to oil traits (Table 1).Note: E1: Nenjiang, E2: Beian, E3: five connected lakes.

Functional Prediction of Candidate Genes by GWAS and RNA-Seq Analysis
To further identify candidate genes regulating oil content, this study conducted transcriptome sequencing on the soybean materials with extremely high and low oil contents.The results showed that 7774 upregulated and 1801 downregulated differentially expressed genes (DEGs) were identified (|log2FC| ≥ 1) (Figure S2).The obtained differential genes were subjected to KEGG enrichment analysis, revealing that a total of 1686 differential genes were significantly enriched in KEGG pathways.The DEGs were predominantly enriched in metabolic pathways (ko01100, p < 8.83 × 10 −8 ), starch and sucrose metabolism (ko00500, p < 0.0005), photosynthesis-antenna proteins (ko00196, p < 1.52 × 10 −9 ), and photosynthesis (ko00195, p < 1.45 × 10 −7 ) (Figure 3).This study explored novel candidate genes regulating oil content, identifying a total of 706 candidate genes from the 100 kb range surrounding the peak SNP (QTL) (Table S2).Combined with transcriptome expression data, 22 candidate genes were differentially expressed (|log2FC| ≥ 3) (Table 2).According to the functional annotation of the candidate gene, Glyma.20G208400exhibits homology with Arabidopsis AT1G62510.It was reported that overexpression of the aspartate aminotransferase gene can increase the amino acid content of seeds in rice [33].The gene Glyma.03G186200exhibits homology with Arabidopsis AT5G03530.Previous studies have demonstrated the role of the small GTPase ARL8B gene in mediating lipid droplet transformation [34].The gene Glyma.17G016900exhibits homology with the Arabidopsis gene AT3G06140.It was found that A U-box type E3 ubiquitin ligase plays a regulatory role in lipid accumulation [35].The above genes were found to have higher expression levels.In addition, other identified genes based on expression levels were also defined as candidate genes.Additionally, four differential genes were selected for real-time PCR, and the results showed that the results of qRT-PCR were basically consistent with the transcriptome data (Figure S3).

Gene-Based Association and Haplotype Analysis of Candidate Genes
In order to determine the sequence variation in candidate genes, association analysis was conducted using the SNPs of the candidate genes and phenotypic traits through GLM.Based on the association analysis, one SNP was identified in the UTR region in the Glyma.03G186200gene (−log10(p) > 2.5), where the oil content of the A allele was significantly higher than that of the G allele (Figure 4A).A total of eight single-nucleotide polymorphisms (SNPs) were identified in the upstream, UTR5, exonic, and UTR3 regions of the Glyma.09G099500gene across the three environments (−log10(p) > 2.5).The oil content associated with the C/T/T/T/T/G/C/A alleles was significantly higher than that of the T/A/C/G/C/T/A/G alleles (Figure 4B).One SNP was identified in the exonic region in the Glyma.18G248900gene in the three environments (−log10(p) > 2.5), and the oil content of C alleles in environments E1 and E3 was significantly higher than that of T alleles, while there was no significant difference in oil content between the C and T alleles in the E2 environment (Figure 4C).

Co-Expression Analysis of Transcription Factors and Candidate Genes
In this study, transcription factors were collected from the PlantTFDB database, and their differential expression levels were determined through transcriptome data.A total of 720 differentially expressed transcription factors were obtained, which were divided into 52 categories.The most abundant TF families mainly included bHLH, ERF, MYB, NAC, and bZIP (Figure S4).

Discussion
Lipid synthesis in plants is typically accomplished through multiple pathways and cell types.Although the fatty acid synthesis pathway has been relatively well elucidated, further investigation is still required to understand the genetic factors that regulate lipid metabolism pathways.Genome-wide association studies (GWAS) have emerged as a prominent approach for elucidating the genetic loci associated with agronomic traits in crops [36,37].Currently, the QTLs associated with significant traits, such as branch number [38], seed size [39], flowering time [40], and stress resistance [41], have been successfully identified.Compared to traditional QTL mapping, GWAS detect more genetic loci due to abundant molecular markers and larger sample sizes.
The continuous advancement of omics data has revealed that single-omics analysis may introduce bias in the investigation of certain plant traits, and it presents limitations in elucidating plant regulatory mechanisms.Multi-omics technology has been widely used in screening important markers of target traits and mining the related candidate genes.A total of 52 SNPs were identified and found to be correlated with four chlorophyll fluorescence parameters through transcriptome and GWAS' analysis.Additionally, RNA-seq analysis led to the screening of three candidate genes in important genomic regions [44].Song et al. used transcriptome and GWAS to analyze soybean seed coat color and found that 182 differentially expressed genes (DEGs) were screened out from five QTLs, including CHS, MYB, and F3 ′ H genes [45].In this study, a total of 22 potential candidate genes (|log2FC| ≥ 3) were identified based on transcriptome methods and GWAS (Table 2).Haplotype analysis revealed that three candidate genes, Glyma.03G186200,Glyma.09G099500, and Glyma.18G248900, had excellent haplotypes (Figure 4).One SNP was identified in the UTR region in the Glyma.03G186200gene (−log10(p) > 2.5), where the oil content of the A allele was significantly higher than that of the G allele (Figure 4A).The Glyma.03G186200 gene encodes the RAB GTPase homolog C2A, and previous studies have demonstrated that the small GTPase ARL8B gene is involved in mediating lipid droplet transformation [34].RabC1 has been identified as a crucial regulator essential for the modulation of lipid droplet dynamics and lipid metabolism.The findings demonstrate that RabC1 is capable of interacting with SEIPIN2 and SEIPIN3, both localized in the endoplasmic reticulum to modulate the mobilization of lipid droplets and ensure adequate lipid availability [46].Previous studies demonstrated the pivotal role of RabC1 GTPase in regulating Arabidopsis growth and seed development [47].A total of eight SNPs were identified in the Glyma.09G099500gene (−log10(p) > 2.5) across the upstream, UTR5, exonic, and UTR3 regions.Among these variants, the oil content of the C/T/T/T/T/G/C/A alleles was significantly higher than that of the T/A/C/G/C/T/A/G alleles (Figure 4B).The Glyma.09G099500 gene encodes a cation efflux family protein.The expression level of the MTP8 gene in Arabidopsis exhibits a continuous increase during seed development [48].Furthermore, one SNP was detected in the exonic region of the Glyma.18G248900gene across the three environments (−log10(p) > 2.5).The oil content associated with allele C in environments E1 and E3 was significantly higher than that associated with allele T, while no significant difference in oil content was observed between alleles C and T in environment E2 (Figure 4C).The Glyma.18G248900 gene encodes an unknown protein.Furthermore, it was observed that Glyma-18G248900 exhibits a significant positive association with bZIP, C3H, and Dof.Overexpression of the soybean bZIP transcription factor (GmbZIP123) leads to an increase in oil content in transgenic Arabidopsis seeds [49].Previous studies demonstrated that the overexpression of GhDof1 leads to an increase in the oil content of upland cotton [50].This study postulates that the regulation of the Glyma18G248900 gene and the subsequent changes in soybean oil content may be mediated by bZIP and Dof transcription factors.Meanwhile, it was found that Glyma.09G099500 is significantly positively correlated with bHLH, MYB, and GRAS.There have been reports suggesting that MYB1 plays a crucial role in the induction of FAT1 expression and facilitates the efficient transport of fatty acids from chloroplasts, which represents a pivotal step in lipid biosynthesis within the endoplasmic reticulum [51].

Plant Materials
In this study, 155 soybean germplasms were used as experimental materials (Table S4).All materials were planted in three designated test locations, namely, Nenjiang (124 • 44 ′ N, 48 • 42 ′ E), Beian (47 • 35 ′ N, 126 • 16 ′ E), and the five connected lakes area (48 • 18 ′ N, 126 • 04 ′ E).Field germplasm was planted with a row length of 2 m, a spacing between rows of 0.6 m, and a density of 30 plants per row.Fifteen soybean plants at the mature stage were randomly selected for the determination of oil content.An Infratec 1241 NIR grain analyzer (FOSS, Hoganas, Sweden) was utilized for quantifying the soybean oil content.

Germplasm Population Genotype Analysis
The genomic DNA of each leaf sample was extracted using the CTAB method.The specific-locus amplified fragment sequencing (SLAF-seq) technique was employed for the detection of amplified fragment sequencing in 155 soybean germplasms.The test samples were subjected to digestion using restriction endonucleases (Mse I and Hae III), resulting in the generation of fragments ranging from 300 bp to 500 bp in length.The barcode method and an Illumina Genome Analyzer II System (Illumina Inc., San Diego, CA, USA) were utilized to generate 45 bp sequence reads at both ends of the sequencing tags from each accession library.The alignment of the acquired raw paired-end reads to the reference genome (Glycine max Wm82.a2.v1) was conducted utilizing BWA software (Version: 0.6.1-r104).For subsequent association analysis, a total of 23,131 SNPs with a minimum allele frequency (MAF) ≥ 5% and deletion rate ≤ 10% were selected using GATK (version 2.4-7-g5e89f01) and SAMtools (Version: 0.1.18).

Population Structure Evaluation, Linkage Disequilibrium, and Genome-Wide Association Analysis
The oil content of 155 soybean materials was analyzed using a compressed mixed linear model (CMLM) in the GAPIT package, employing 23,131 single-nucleotide polymorphisms (SNPs).Significant SNP loci were screened with −log10(p) > 3 as the threshold.The first three principal component analyses (PCAs) were included as covariates in the subsequent analysis.

Transcriptome Sequencing Analysis
Total RNA extraction from soybean grain at the R6 stage was performed using TRIzol reagent (Invitrogen, Carlsbad, CA, USA).The eukaryotic mRNA was selectively enriched using magnetic beads conjugated with Oligo (dT) following the assessment of total RNA quality in the samples.Double-stranded cDNA was synthesized using mRNA as a template and six-base random primers, followed by purification and end repair of the cDNA.The size selection of cDNA fragments and accurate quantification of the effective concentration for library inspection were achieved using AMPure XP beads.The sequencing was conducted using the IlluminaHiSeq platform after meeting the qualification criteria.The raw reads obtained through sequencing must undergo quality control (QC) to eliminate low-quality reads and bases, resulting in the acquisition of high-quality clean reads.The soybean reference genome (glycine max Wm82.a2.v1) was utilized for sequence alignment.The differentially expressed genes were identified using thresholds of |log2Fold Change| > 1 and p < 0.05.Three biological replicates of each material were applied in this study.

Prediction of Candidate Genes
The 100 kb genomic region (upstream and downstream) of each significant SNP is defined as the putative candidate gene.The identification of putative candidate genes within the confidence interval was achieved by employing a combination of GWAS and transcriptome analysis, aiming to explore the regulatory mechanisms underlying oil content variation.A set of 28 soybean lines (14 extremely high oil and 14 extremely low oil) was derived from genome resequencing data to discern variations in candidate genes encompassing 5 ′ UTRs, 3 ′ UTRs, exons, and promoter regions.A general linear model (GLM) implemented in TASSEL 5.0 software was employed for haplotype analysis of candidate genes.

Co-Expression Analysis
The online plantTFDB database was utilized for the screening of soybean transcription factors.The differentially expressed transcription factors were selected for correlation analysis with potential candidate genes, and those with a Pearson correlation coefficient threshold (r > 0.98, p < 0.05) were selected.The visualization of the co-expression network was generated using Cytoscape 3.10.1 software.

Quantitative Real-Time PCR
Differential candidate genes were screened and analyzed by quantitative real-time PCR.The quantitative real-time PCR was conducted using a SYBR Green Realtime PCR Master Mix Kit (TOYOBO, Osaka, Japan).The relative expression level was determined using the 2 −∆∆ct method.Three biological replicates and three technical replicates were applied in this study.GmACTIN4 was used as internal control.All primers of qRT-PCR were generated in Table S5.

Conclusions
In this study, a total of 155 soybean materials were utilized.A total of 36 QTLs were found to be significantly correlated with oil content by GWAS analysis.Through the integrated GWAS and transcriptome analysis, 22 potential candidate genes were identified in this study.Haplotype analysis revealed that three candidate genes, Glyma.03G186200,Glyma.09G099500, and Glyma.18G248900, had excellent haplotypes.Furthermore, co-expression analysis revealed a significant correlation between Glyma.09G099500 and Glyma.18G248900 with bHLH, bZIP, MYB, and Dof transcription factors.

Figure 5 .
Figure 5. Co-expression network analysis of candidate genes and transcription factors.

Table 1 .
Genetic overlap between the selected QTLs and known QTLs.

Table 2 .
Candidate differential genes were identified by transcriptome and GWAS.